Searching Web Data: an Entity Retrieval Model
نویسندگان
چکیده
More and more (semi) structured information is becoming available on the Web in the form of documents embedding metadata (e.g., RDF, RDFa, Microformats and others). There are already hundreds of millions of such documents accessible and their number is growing rapidly. This calls for large scale systems providing effective means of searching and retrieving this semi-structured information with the ultimate goal of making it exploitable by humans and machines alike. This dissertation examines the shift from the traditional web document model to a web data object (entity) model and studies the challenges and issues faced in implementing a scalable and high performance system for searching semi-structured data objects on a large heterogeneous and decentralised infrastructure. Towards this goal, we define an entity retrieval model, develop novel methodologies for supporting this model, and design a web-scale retrieval system around this model. In particular, this dissertation focuses on the following four main aspects of the system: reasoning, ranking, indexing and querying. We introduce a distributed reasoning framework which is tolerant against low data quality. We present a link analysis approach for computing the popularity score of data objects among decentralised data sources. We propose an indexing methodology for semi-structured data which offers a good compromise between query expressiveness, query processing and index maintenance compared to other approaches. Finally, we develop an index compression technique which increase both the update and query throughput of the system. The resulting system can index billions of data objects and provides keyword-based as well as more advanced search interfaces for retrieving the most relevant data objects. This work has been part of the Sindice search engine project at the Digital Enterprise Research Institute (DERI), NUI Galway. The Sindice system currently maintains more than 100 million pages downloaded from the Web and is being used actively by many researchers within and outside of DERI. The reasoning, ranking, indexing and querying components of the Sindice search engine is a direct result of this dissertation research.
منابع مشابه
Searching web data: An entity retrieval and high-performance indexing model
More and more (semi) structured information is becoming available on the Web in the form of documents embedding metadata (e.g., RDF, RDFa, Microformats and others). There are already hundreds of millions of such documents accessible and their number is growing rapidly. This calls for large scale systems providing effective means of searching and retrieving this semi-structured information with ...
متن کاملSearching Web 2.0 Data Through Entity-Based Aggregation
Entity-based searching has been introduced as a way of allowing users and applications to retrieve information about a specific real world object such as a person, an event, or a location. Recent advances in crawling, information extraction, and data exchange technologies have brought a new era in data management, typically referred to through the term Web 2.0. Entity searching over Web 2.0 dat...
متن کاملSIREn: Entity Retrieval System for the Web of Data
We present ongoing work on the Semantic Information Retrieval Engine (SIREn), an “entity retrieval system” specifically designed to meet the requirements of indexing and searching a large amount of semi-structured data, e.g. the entire Web of Data. SIREn supports efficient full text search with semi-structural queries and exhibits a concise index, constant time updates and inherits Information ...
متن کاملSearching for Entities When Retrieval Meets Extraction
Retrieving entities inside documents instead of documents or web pages themselves has become an active topic in both commercial search systems and academic information retrieval research. Our method of entity retrieval is based on a two-layer retrieval and extraction probability model (TREPM) for integrating document retrieval and entity extraction together. The document retrieval layer finds s...
متن کاملBehavioral Considerations in Developing Web Information Systems: User-centered Design Agenda
The current paper explores designing a web information retrieval system regarding the searching behavior of users in real and everyday life. Designing an information system that is closely linked to human behavior is equally important for providers and the end users. From an Information Science point of view, four approaches in designing information retrieval systems were identified as system-...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011